Welcome everybody to deep learning. So today we want to look into further common practices
and in particular in this video we want to discuss architecture selection and hyperparameter optimization.
Remember the test data is still in the vault, we are not touching it.
However, we need to set our hyperparameter somehow and you've already seen that there's
enormous amount of hyperparameters. You have the architecture, number of layers,
number of nodes per layer, activation functions, then you have all the parameters and the optimization,
the initialization, the loss function, the optimizers like still has the gradient descent,
momentum, add-on, learning rate, decay and batch size and in regularization you have different
regularizers, L2, L1 loss, batch normalization, dropout and so on.
And you want to somehow figure out all the parameters for those different kinds of procedures.
Now let's choose architecture and loss function. And the first step would be think about the
problem and the data. How could features look like? What kind of spatial correlation do
you expect? What data augmentation makes sense? How will the classes be distributed? What
is important regarding the target application? Then you start with simple architectures and
loss functions and of course you do your research. Try well-known models first and foremost.
They are being published, there's so many papers out there, there is no need to do everything
on yourself. So one day in the library can save hours and weeks and months of experimentation.
Do the research, it will really save you time and often they just don't publish a single
paper but in the very good papers it's not just the scientific results but they also
share source code, sometimes even data. Try to find those papers, this can help you a
lot with your own experimentation.
So then you may want to change, adapt this architecture to found in literature and if
you change something, find good reasons why this is an appropriate change. There's quite
a few papers out there that seem to introduce random changes into the architecture and then
later it turns out that the observations that they made were essentially random and they
were just lucky or experimented enough on their own data in order to get the improvements.
Typically there's also a reasonable argument why the specific change should give an improvement
in performance. Next you want to do your hyperparameter search. So you remember learning rate, decay,
regularization, dropout and so on. These have to be tuned but still the networks can take
days or weeks to train and you have to search for these hyperparameters and we recommend
using a log scale. So for example for IDA here you go for 0.1, 0.01, 0.001. You may
want to consider grid search or random search. So in grid search you would have equal distance
steps and if you look here at reference two they have shown that the random search has
really advantages over the grid search. First of all it's easier to implement and second
it has a better exploration of the parameters that have a strong influence on the result.
So you may want to look into that and then adjust your strategy accordingly. So hyperparameters
are highly interdependent. You may want to use a coarse to fine search therefore. So
you optimize on a very coarse scale in the beginning and then make it finer and finer.
You may only train the network for a few epochs and then bring all the hyperparameters in
sensible ranges and then you can refine using random and grid search. A very common technique
that can give you a little bit of boost of performance is ensembling. So this is also
something that can really help you to get this additional little bit of performance
that you still need. So far we have only considered a single classifier but ensembling has the
idea by using many of those classifiers. If you assume n classifiers that are independent
performing a correct prediction will be at a probability of 1 minus p. Now the probability
of seeing k errors is n choose k p to the power of k 1 minus p to the power of n minus
k and this is a binomial distribution. So the probability of a majority meaning k greater
than n over 2 to be wrong is the sum over n choose k p to the power of k 1 minus p to
the power of n minus k. So we visualize this in the following plot. Here in this graph
you can see that if I take more of those weak classifiers and we set for example their probability
Presenters
Zugänglich über
Offener Zugang
Dauer
00:08:57 Min
Aufnahmedatum
2020-05-31
Hochgeladen am
2020-06-01 00:26:36
Sprache
en-US
Deep Learning - Common Practices Part 2
This video discusses the use of validation data and how to choose architectures and hyperparameters and discuss ensembling.
Further Reading:
A gentle Introduction to Deep Learning